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ABSTRACT 

Summarizing work which is part of an Army research 
program on Methodological Issues in the Construction of Criterion 
Referenced Tests, the focus of this paper is on a Bayesian model, 
which gives the probability of correctly classifying an examiner as a 
master or as a nonmaster while taking into consideration the test 
length and the mastery cut-off score. . Bayes* Theorem is a 
mathematical expression which allows the combination of information 
about the quality of the examinee population so as to produce a 
probabilistic estimate of mastery for a specific examinee. This 
approach can give the most accurate ability estimate for each 
examinee by using the fewest number of test items, provided that 
accurate estimates of the "quality parameters" have been made. A 
method of estimating these parameters from commonly available 
information is also explained. (Author/BW) 



* Documents acquired by ERIC include many informal unpublished * 

* materials not available from other sources. ERIC makes every effort 

* to obtain the best copy available. Nevertheless, items of marginal 

* reproducibility are often encountered and this affects the quality 

* of the microfiche and hardcopy reproductions ERIC makes available 

* via the ERIC Document Reproduction Service (EDRS) . EDRS is not 

* responsible for the quality of the original document. Reproductions * 

* supplied by EDRS are the best that can be made from the original. ♦ 



ERLC 



U S DEPARTMENT OF HEALTH. 
EDUCATION A WELFARE 
NATIONAL INSTITUTE OF 
EDUCATION 

THIS DOCUMENT HAS BEEN REPRO- 
DUCED EXACTLY AS RECEIVED f^ROM 
THE PERSON OR ORGANIZATION ORIGIN- 
ATINC IT POINTS OF VIEW OR OPINIONS 
STATED DO NOT NECESSARILY REPRE- 
SENT OFFICIAL NATIONAL INSTITUTE OF 
EOilCATIQN POSITION OR POLICY 



A Bayesian Method for Maximizing 
Correct Mastery Classifications 

Frederick Steinhelser, Jr., PhD 
Army Research Institute for the 
Behavioral and Social Sciences 
Arlington, Virginia 22209 



I ntroduction 



GO 



This paper summarizes work which is part of an on-going research pro- 
gr^in at the Army Research Institute. The program, called Met test — 
Methodological Issues in the construction of Criterion-Referenced Tests — 
f^* is exploring and developing psychometric models for defining test stan- 
^ dards and test lengths. Tixe focus of this paper is upon a "Bayesian" 
Q model, which gives the probability of correctly classifying an examinee 
\jj as a master or as a nonmaster while taking into consideration the test 
length and the mastery cut-off score. 

Personnel assessment is typically made through the development and 
administration of tests, and the evaluation of test scores. The final 
desired output of a test for a given examinee is information which allows 
us to pinpoint his ability to do whatever is required by an objective* 
That is, we observe a test score and must then infer the ability of the 
examinee • 

In Norm-Referenced Testing, an examinee's score is evaluated with 
respect to his position among all of the other scores in the examinee 
population. But in Criterion-Referenced Testing his score is evaluated 
with respect to hf.s passing or not passing a particular instructional 
objective, independent of the ccor<:iV obtained by others in the examinee 
population. A passing score inditetes- that he is a "master" of that 
particular instructional obj^'ictivei and a failing score indicates that he 
is a "nonmaster." 

Ideally, if an examinee's score on a test is above the minimal passing 
standard, he would be correctly classified as having "mastery ability." 
Ability assessment would therefore be based upon 100% correct classifica- 
tions. But since we live in a less than ideal world, there will be vari- 
ability due to the imperfections in the test construction, psychological 
wm^ variability Cforgetting, guessing, individual differences) in the examinee 
^ population, and other unknown sources of error. Hence, sometimes a person 
3) who is really not a master (in the ideal world) will be classified as a 
master on the basis of his test score; and sometimes a person who really " 
is a master w.Ml be unfortunate enough to be classified as a nonmaster*. 
The following »-rhart illustrates these four classification outcomes. 



The views expressed in this paper are those of the 
author and do not necessarily imply endorsement 
by the U.S. Army. 

604 

2 



True Ability State: 





Master 


Noniuaster 


Classification Master 


True 


False 


Based Upon 


positive 


positive 


Test Score 






Nonmaster 


False 


True 




negative 


negative 



In order to approach the ideal classification accuracy, 1:he probability of 
a True positive should be much greater than that for a Ffilse positive, and 
the probability for a True negative should be much greater than that for a 
False negative. The classification problem has now been cast into a 
decision-making framework, for which "Bayes' Theorem" may be used: we 
wish to obtain the probability of an examinee being in the Mastery Ability 
state given (conditional upon) his test score. Symbolically, this is 
expressed as p(Mi|t), where Ml refers to the mastery state, and T is the 
test store of that specific individual. 

An Example Using Bayes' Theorem 

Bayes' Theoreixx is a mathematical expression which allows us to com- 
bine information about the quality of the examinee population so as to 
produce a probabilistic estimate of mastery for a specif ^.c examinee. 
This approach will give the most accurate ability estimate for each 
examinee by using the fewest number of test items, provided that accurate 
estimates of the "quality parameters" have been made. (A subsequent p^c-- - 
tion of this paper will show exactly how to estimate these quality 
parameters from data similar to that which you might have.) First, let's 
take a look at the mathematical expression, the estimates that we have to 
feed into it, and the output that it gives us: 

p(Ml T) = p(T/Ml)p(Ml) 

' [p(TjMl)p(Ml) + p(T/M2)p(M2)] 

Here we assume that the 2 states of nature (master and nonmaster) are 
mutually exclusive and collectively exhaustive, and that T is the test 
score which is observed. We also assume that the test is dichotomously 
scored. A correct response is denoted "1", an incorrect response is 
denoted "0" and the total test score is simply the number of correct 
responses. What we seek to find is the term on the left, the probability 
that a given student is a master, having been given his test score. In 
order to find it, we need to have an estimate of the prior probability of 
mastery (p(Ml)) in the population of students from which this Student was 
drawn. The prior probability of mastery can be thought of as the pro- 
portion of students in the examinee population we think are masters. For 
example, if our instruction were very good the prior probability of 
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mastery would be high, and most of the students who completed the instruc- 
tion should have mastered the objective. The actual number specified for 
the prior probability of mastery may be an informed guess based on exper- 
ience or it may be based on the empirical results of tests given to pre- 
vious classes of similar students. 

We must also estimate the conditional probability of a certain test 
score given that the student who got that socre was a master. For example, 
if only one item were administered, the conditional probability of a score 
of one correct given that the student was a master is simply the proba- 
bility that a master responds correctly. We may estimate this conditional 
probability empirically based on previous student groups, or we may pro- 
vide a best guess as to how well masters perform, or this conditional 
probability may reflect a minimal standard of achievement. We shall show 
how the p(m/t) will vary as a function of the prior expectations of the 
tester., nximber of test items, au«i conditional probabilities, p(t|m), after 
an example to illustrate the computations* 

Suppose that a student chosen at random from a trainee population was 
given a criterion-referenced test, and that he passed the test. Given the 
results of the test, what is the probability that the student* is indeed a 
masf:er of that particular course of instruction? In order ^o calculate 
the probability, we obtain the following information from the educational 
expert who administered the CRT: The probability that a master would 
obtaiii a passing score = .90, (p(t/m1) ^ .90); the probability that a 
nonmaster would obtain a passing score = .05, (p(t|M2) = .05); and the 
prior probability of randomly selecting a master from this trainee popu- 
lation is equal to .70, that is, we believe rhat 7v)% of this and similar 
previous trainee populations may be assumed to be composed of masters. 
Substituting these values into the formula; 

p(M1(t) = .9 X .7 

.9 X .7 + .05 X .3 

which equals .977. Hence, before the test score was available, the 
probability that this student was a master was . 70 , but after a passing 
score was observed, the probability that this person is a master has 
increased to .977 . (The probability of this student's being a nonmaster, 
given the same passing score, p(M2|T), would be equal to 1-.977 or .023). 

Methods for Estimating "Quality Parameters" (Prior and Conditional 
Probabilities) 

This model assumes that background information about students who 
previously took the test is available. This background information 
should lead to the accurate estimation of parameters that describe the 
quality of the examinee population. We need information to estimate the 
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prior probability of a randomly selected student being in one of the 
assumed mastery ability groups. We also need to be able to estimate 
the conditional probability that an examinee from a particular ability 
group would get an item right. For purposes of illustration, let's 
assume that 100 examinees produced the following distribution of scores 
on a five item test: 

Number of items Frequency p (Correct) 

correct 



5 


30 


1.0 
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30 


.8 


3 


00 


.6 


2 


10 


.4 
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20 


.2. 


0 


10 


0.0 



For this particular set of data, it seems reasonable to postulate two 
ability groups. Note that 60 examinees got either 4 or 5 test items 
correct, and that a total of 40 examinees got 0, 1, or 2 items correct. 
No one got 3 items correct. Hence, this bimodal distribution of scores 
strongly suggests that we may set the prior probability of mastery equal 
to .6, and the prior probability of nonmastery equal to .4. Symbolically, 
p(Ml) = .6, and p(M2) = .4. 

We also need to estimate the conditional probability that a correct 
response is made to an item, given the particular mastery or ability 
state. Symbolically, we seek p(x=i|m). There are several ways to com- 
pute this parameter value. Unfortunately, each method produces a unique 
value. 

We could take the average proportion correct for the mastery group. 
For the present data, this would produce 30 x 1.0 4- 30 4- .8 = .9. 

60 

We could also take the lowest score in the mastery group, which in 
this case is .8. 

We could also take the desired (or "standard ") score required for 
the demonstration of mastery, which need not necessarily be observed. 
This value for the present data could be set at .70, or .71, or .82, etc. 

This variety of estimated values should not be distressing, since it 
allows the examiner to introduce his own requirements into the selection. 
The important thing is that his choice should be close to at least one 
of the logically derived estimates. 

5 
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The Bayesian Mathematical Model 



In order to generalize the Bayesian approach to a wide variety of 
applications in personnel assessment, two additions must be made to the 
previously described fotmula. These additions are the number of trials 
or itCiUS on the test (N) , and che number of hypothesized mastery ability, 
states (r' . The derivation of the general formula to meet these goals 
was* originally presented by Hershman (1971): 



p(Mi|T) 



In this formula, p(Mi)tj) equals the conditional probability of a person 
in the ith mastery state getting the jth test item correct; p(Mi) is the 
prior probability of the representation of the ith mastery state in the 
student population (the % of students who are estimated to be in the ith 
mastery state); and p(Mi|T) is the conditional probability of a particu- 
lar student being classified as being in the ith mastery state given his 
total test score. A computational example showing how the formula is | 
applied for three mastery states is given in the Appendix. 

This formula was not used to analyze any "real" test score data. 
Rather, selected values of the various parameters were systematically 
manipulated in order to determine their influence on the probability of 
mastery classification. Hence, the results are from d computer Simula* * 
tion of idealized data and should serve to emphasize the relative effects 
of each of the parameters. 

The two parameters that estimate the quality of the examinee popu- 
lation, the prior probability of selecting a master from the population, 
and the conditional probability of a known master and of a known non- 
master getting a randomly selected item correct, have already been dis- 
cussed. Basically, p(Ml) reflects the proportion of masters in a. group 
of examinees (and the level of training), whereas p(l|Ml) and p(l/M2) 
will be high if the test is easy and lower if the test is difficult. The 
ideal criterion referenced test should provide a high probability for the 
former, and low probability for the latter. 

Two other parameters, the minimal passing standard (the per cent of 
all the items that were answered correctly) and the test length are 
interrelated. By way of analogy, consider the minimal passing standard 
for deciding that a coin is biased to be 70%. That is, if heads (or 
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tails, for that matter) come up on 70% of the tosses, we would evaluate 
the coin as being unfair. Note that the 70% figure is arbitrary. We 
could have set the standard at 65%, or 75%, or 80%, etc. Now how many 
tosses (test items) do we want to observe? 10? 50? 100? 1,000? If 
we observe 7 heads out of ten tosses, and 700 heads out of 1,000 tosses, 
does the probability of the fairness of the coin remain the same? It 
should be intuitively obvious (and it can be easily demonstr.Ated by means 
of the binomial distribution) that the minimal passing standard interacts 
with test length. The probability that a coin is fair ;>?hen 7 heads out 
of ten tosses are observed is much greater than when 700 heads out of 
1,000 tosses are observed — even though the. "70%" criterion was strictly 
maintained! Values of 60%, 70%, and 80% correct were uyed in the current 
simulation. The number of trials or items (N) took on values of 5, 10, 
20, and 40. 

The final parameter which was manipulated in the model is the number 
of assumed mastery states. It may be overly simplistic to assume that 
the world is divided into just two dichotomous and mutually exclusive 
states, of mastery and nonmastery. Perhaps there are varying degrees of 
mastery, ranging from "complete" to "partial'* to "totally incomplete." 
The present model is able to handle any number of assumed mastery states. 

The model makes the following tvo important assumptions concerning 
the nature of the test from which the data are derived: (1) The test 
measures a unidimensional latent trait or unitary skill; (2) Test items 
or trials are aqually difficult for a given ability. An elaboration of 
the basic model can easily be made to include test items with varying 
degrees of difficulty. 

Changes in p(MfT) Assuming Two Mastery S. ^es 

Tne fundamental purpose of the present study was to investigate how 
the probability of mastery classification changes as a function of the 
simultaneous manipulation of up to four parameters (independent variables). 
The scope of the study is not exhaustive, since only several were used. 
However, some general trends do seem to emerge as can be seeu in the 
following figures. 

Figures 1, 2, and 3 show the results of applying the model to a 
situation in which only two mastery (groups (mastery and nonmastery) 
have been hypothesized. The data points represent the probability that 
a trainee is a master, given (conditional upon) his total test score, 
P(M/T). The curvature of each line shows how the P(M|T) changes as a 
function of variations in the prior expectation of mastery, the % corTT^r-.t 
items observed, the conditional probabilities of both a master and a non- 
master responding correctly to an item, avid the nim^ber of items comprising 
the test. 
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Figure 1 represents a testing situation in which the training was of 
extremely high quality, since the proportion of masters in the trainee 
population was assumed to equal 0.9. That is, p(Ml) =0.9. Figure lA 
portrays the situation in which both masters and nonmasters have attained 
a rather high degree of proficiency, since the probability of a master 
responding correctly to any given item is 0.9, and the probability of a 
nonmaster responding correctly is 0.6. If a person scored 80% on a five 
item test, the probability that he is a master is approximately .91. This 
probability drops to .65 if a 60% score on five items (3 out of 5 correct) 
were obtained. Note that when the test length is increased to 40 items, 
an 80% score (32 correct) produces a .99 probability of mastery. How- 
ever, a score of 60% (24 correct) yields an essentially zero probability 
of mastery. The effect of the test length variable on classification 
accuracy is dramatic: if the p(m|t) had to be at least 0.5 for a person 
to be called a paster, then scores of 60% on a five item test would lead 
to mastery classification. But a 60% score on a 4£ item test would lead 
to nonmastery classification. 

Figura lA also illustrates the effect of "prior beliefs" on p(m/t) . 
Intuitively, one might suppose that the chances were much higher that a 
person who obtained a score of 60% (even from a 5 item test) came from a 
population whose probability of correctly answering an item was 0.6 than 
from a population whose probability of answering an item correctly was 
0.9. However, the relative proportions of the two groups (expressed as 
prior belief in mastery and nonmastery^ or 'p(Ml) = .9 and p(M2) = .1, 
respectively) are such that the probability of a person being in the 
mastery state is approximately 0.65 for a score of 3 correct (60%) on a 
5 item test. Only by increasing the numbet of test items can the strong 
prior bias in favor of the mastery decision be reversed. Figures 2A and 
3A show what happens when prior beliefs are not so heavily biased in 
favor of mastery. In neither case is the probability of being in the 
mastery state above 0.5 for scores of less than 80%. But Figure lA 
suggests that when prior beliefs heavily favor one group over the other, 
longer length tests should be used. Otherwise, the amount of data may 
not be sufficient to force a change in the originally held prior beliefs. 

The effect of chan^ ,ig the prior beliefs concerning the proportion 
of masters and nonmasters in the examinee population while holding all 
other parameters co*nstant can be seen by comparing corresponding graphs 
A, B, C, and D in Figures 1, 2, and 3. As the prior beliefs approach 
equiprobability (where p(Ml) = ii(M2) = 0.5), more items are required to 
maintain a given level of confidence that* a person is either a master or 
nonmaster. The inability to postulate strong prior beliefs must be com- 
pensated for by increasincr the '*:est length in order to maintain a constant 
classification accuracy. 

The effect of changing the probability of a correct response, 
p(l|Mi), can be seen by comparing graphs A, B, C, and D for Figures 1, 2, 
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and 3. For example, the only difference between Figure lA and Figure IB 
is that the p(l|Ml) changes from 0.9 to 0.8, all other parameters being 
held constant. (This change might reflect a lower level of required 
proficiency, and hence less training, for Graph B than for A. Or perhaps 
previous test results indicate that masters of the instruction respond to 
items with a probability of correct response equal to 0.8 rather than 0.9.) 
In any case, the effect of this small change in the p(l|Ml) on the p(m/t) 
is readily apparent. For any test length or observed test score, the 
probability of being in the mastery state is greater in Graph B than in A. 
This shift is most obvious for the 70% observed correct curve. Notice 
that p(m/t) on Graph A for an observed score of 70% (28 out of 40 correct; 
is approximately 0.04 . However, the value for p(Mfr) in Graph B for 70% 
of a 40 item test correct is 0.87 . 

The main reason for this abrupt change from Graph A to B (in Figures 
1, and 3) is the lowered requirement for mastery, from 0.9 to 0.8. The 
prot/ability that "0.9 persons" score only 70% correct on long tests is 
relatively low. But when masters are defined as those trainees who come 
from a population with a probability of responding correctly equal to 0.8, 
the probability of their scoring 70% on a long test is high. One of the 
most difficult jobs for an instructional designer is to describe the 
level of capability required of graduates and the level of capability 
actually achieved. Comparison of these graphs indicates the magnitude of 
the effect r.hat these specifications can have on the classification of 
trainees. 

Graphs C and D of Figures 1, 2, and 3 further illustrate the effect 
of variations in the probability of correct responses. The only differ- 
ence between Graphs B and C is that the probability of a correct response 
from a nonn.aster decreases from 0.6 to 0.5. The effect of this decrease 
in correct response probability from a nonmaster is to lower the lik.eli- 
hood of a nonmaster achieving a test score of at least 70%, which also 
increases the probability that a person achieving a high % score is in 
the mastery state. Finally, Graph D portrays an extreme case in which 
neither masters nor nonmasters are responding at particularly high levels. 
However, the level of performance for nonmasters is so low (0.4), that 
even for observed scores of 60% the probability of being in the mastery 
state exceeds 0.8 for all test lengths, except for 5 and 10 items in 
Figure 2, and 5, 10, and 20 items in Figure 3. 

Further detailed analysis of these figures is not included in this 
paper. In comparing the twelve graphs against each other, note the mag- 
nitude of the changes in p(m/t) when small changes have been made in the 
prior beliefs, in the correct response probabilities, and in the percent 
correct observed responses. The implication is that extreme care must be 
taken when specifying parameters in a Bayesian approach to testing and 
decision making. If the parameters are realistic, great savings in 
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testing time and expense, and increased confidence in decision making are 
possible (Novick & Lewis, 1974). However, if the parameters are not 
realistic, there is a very real danger of misclassifying many exan tnees. 
The next section of this paper deals with an elaboration of tlie model to 
three mastery states » thus helping to quantify sources of classification 
error. 

Elaboration to Three Mastery States 

Figures 4, 5, 6, and 7 represent cases for which three mastery states 
have been hypothesized. In figures 4 and 6 the probability of a correct 
response for a person assumed to be in mastery state Ml equals 0.8 ^ for 
mastery state M2 this probabilicy is 0.6, and for mastery state M3 it is 
0.5. These values could correspond to the situation in which the non- 
mastery group was divided in half. That is, those persons whose prob- 
ability of getting any given item correct is 0.5 (comprising mastery 
state M3) would need extensive retraining; whereas those whose probabil- 
ity of getting any given item correct is 0.5 (comprising mastery state 
M3) would need extensive retraining; whereas those whose probability is 
0.6 (comprising mastery state M2) would merely need selective retraining. 
People in mastery state Ml have a probability of 0.8 Tor making a correct 
response, and may therefore be considered as "masters" who have success- 
fully passed training. 

For Figures 5 and 7 the corresponding probabilities of a correct 
response for people in mastery states Ml, M2 and M3 are 0.9, 0.8, and 
0.6, respectively. These probabilities might describe a situation in 
which the mastery group was dichotomized, perhaps in an attempt to 
identify those students who had achieved an exceptionally high level 
of proficiency, i.e., p(i/m1) ~ 0.9. 

In Figures 4 and 5 the prior probability (or assumed proportion) 
of examinees in each mastery state are: p(Ml) = 0.5, p(M2) = 0.3, and 
p(M3) =0.2. In Figures 6 and 7 the corresponding prior probabilities 
are 0.25, 0.50, and 0.25, respectively. The prior values in Figures 4 
and 5 display a bias towards higher levels of mastery (50% of the exarji- 
inees are assumed to be type Ml masters), whereas the bias in Figures 
6 and 7 is towards the intermediate level of mastery (50% of the exam- 
inees are assumed to be type M2 masters). 

A detailed analysis of Figures 4 and 5 will provide the basis for 
an interpretation of Figures 6 and 7, which is an exercise left to the 
reader. The three graphs labeled A, B, and C represent the probability 
that an individual is in mastery state Ml, M2, and M3, respectively. 

Graph A shows the probability that an individual is in mastery state 
Ml given observed scores of 60%, 70%, and 80% correct on 5, 10, 20, and 
40 item tests. Thus, for an observed score of 4 out of 5 correct, the 
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probability that this person is in mastery state Ml it; about 0.6!). But 
if this same person got a score of 32 out of 40 (still 80% correct), the 
probability that he is an Ml master jumps to 0.98. These results are 
similar to those obtained when two mastery groups were hypothesized, 
end again illustrate the effect of increasing test length on the level 
of confidence in the mastery classification p(M|t). 

The probabil-lty of being in mastery state M2 given observed scores 
is plotted in Graph B. If a person got 4 out of 5 correct, the probabil- 
ity of being in state M2 is about 0.25. However, if he got 32 out of 40 
correct (still 80% correct), this probability plummets to 0.J2. Finally, 
using these same test score values. Graph C shows that the probability 
of being a type M3 Jiasts^.r is 0.10 for 4 out of 5 correct, and nearly r^ero 
for 32 out of 40 correct. This result makes intuitive sense, because 
there is ouly 20% of type M3 (non)masters in the examinee population, and 
the probability of their getting any item correct is only 0.50, which is 
a long way from 80% observed correct. 

Notice that for any given test length and percent correct, the sum 
of the probabilities of being in states Ml, M2, and M3 equals 1.0. Com- 
parison of Graphs A, B, and C shows that when either 70% or 80% of the 
items fcr any test length are correctly answered, the probability of being 
in state Ml is greater than the probability cf being in either state M2 
or M3. That is, both the 70% and 80% curves are higher in Graph A than 
in either Graph B or C. For an observed score of 60% the probability of 
being in state M2 is greater than for Ml or M3. The probability of being 
in state M3 is rather low for all values of test length and percent cor- 
rect observed in this particular example. 

In Figure 5 the interrelationship between test length and three 
hypothesized mastery states becomes even more apparent. For example. 
Graph A shows that the probability of being in state Ml for 80% correct 
on a 5 item test is about 0.48. The probability of being in state M2 
(show*! in Graph B) for 80% correct on a five item test is about 0.36. 
There is thus a greater chance that a person whose score is 4 out of 5 
is in Ml (pCM1|t) = 0.48), instead of M2 (p(M2|t) = 0.36) or M3 (p(M3/T) 
» 0.16). However, if a score of 80% correct were observed on a 40 item 
test, the graphs indicate that a much different decision would be appro- 
priate. In this case, p(MljT) equals 0.21, p(M2/t) = .78, and p(M3/t) = 
0.01. Hence, people scoring 32 out of 40 correct should be classified 
as type M2 masters. Also note that a scor^* of 60% for any test length 
implies that these people should be placed in the M3 state. 

For the data used in Figure 5, the probability of finding Ml type 
masters is overall quite low. Instead, for the levels of achievement 
demonstrated by obtained scores of 60%, 70%, or 80, it is more likely 
that such scores were produced by people in mastery states M2 (pCl/M2l= 
0.8) and M3 CpCr/M3) = 0.6). ' ^ 
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Te St Length, and Mlsclassif icaticn Error 



One of the most important questions that must be answered in design- 
ing a training evaluation program is: What is the probability of falsely 
classifying a person on the basis of a given observed score? It is also 
possible to turn the. question arourd and ask: How long must a test be, 
and what score is required for clabsif ication decisions to be made with 
some specified lower limit of misclascif ication? 

Figures 8 and 9 demonstrate how the Bayesian model can be used to 
answer the above questions. Assuming that the prior and conditional 
probabilities are realistic and fixed, the important variables are then 
test length and cutting score. Suppose that p(Ml) = 0.9, p(M2) = 0.1, 
p(1|m1) =0.9, and p(l(M2) = 0.6 as in Figure 3. In this example, the 
prior belief that an untested trainee is a master is very high, p(Ml) = 
0.9. A reasonable question might therefore be: What score must be 
observed such that a nonmastery uecision can be made with at least 90% 
confidence? In other words, what data are required to force a reversal 
in the prior belief? 

To be 90% confident of a nonmastery decision, p(M2/t) must be equal 
to at least 0.90. Since the sum of p(M1/t) and p(M2/T) equals 1.0, 
p(M1/t) must therefore not be greater than 0.10. Referring to Figure 8, 
a horizontal line crossing the ordinate at 0.10 can be drawn. This .line 
crosses the curve for a five item test at a point corresponding to 26% 
correct. The next lowest possible test score is one correct (20%), so 
the decision rule is that all persons scoring one correct or less should 
be considered nonmasters. The point on the ordinate corresponding to 
20% correct on the five item test is about 0.05. Hence, the final 
decision rule states that nonmastery decisions based on an observed 
score of one correct out of five can be made with 95% confidence (1.00 
- 0.05 = 0.95). For observed scores lower than the cutoff score the 
confidence in making a correct decision must increase. Continuing wiln 
the present example, the (pMl|T) if zero correct are observed is vir- 
tually equal to zero. Hence, those persons who get no items right may 
be classified as M2 t>pe nonmasters with nearly 100% confidence o 

A similar analysis applied to the 40 ite test curve indicates that 
the cutting score should be about 73% correr.:. The next lowest possible 
score to 73% is 70%, which yields exactly Ij correct out of 40 item*^. 
The probability of mastery given an observed score of 28 correct i^ about 
0.04. At such a low value of p(Ml|T) the chances for misclassif ication 
using a five item test and a 40 item test are almost the same. However, 
the observed percent correct at which the nonmastery decision is made for 
the two tests is 20% on the five item test and 70% on the 40 item test. 
Superficially, two tests of different lengths would seem to produce the 
same decision outcome, and that longer tests may not really be necessary 
for reducing classification error. 
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In order to appreciate the benefits gained by using longer length 
tests, the entire curve must be examined. Note that at 80% correct the 
five iten test yields pOn/x) to equal 0.92, This result suggests that, 
on the average, 8% of the mastery decisions will be in error . For the 
40 item test, the probability of mastery given 80% correct is about 0.99. 
That is, there is only about 1% chance of misclassif ication error. A 
test that distinguishes sharply between masters and nonmasters is one in 
which the probability of mastery is close to either 0.0 or 1.00 for most 
obtained scores. On such tests there is only a small region in which 
classification error is large. For example, in Figure 8 for the AO item 
test, the region where pCMl/T) is greater than 0.1 and less tihati 0.9 
extends from 71% to 77% correct. This means that the probabilJ' of 
misclassifying a person will exceed 0 . 10 only when observed scores range 
from 71% to 77% correct. In contrast, the region of the five item test 
curve for which p(Ml|T) is greater than 0.10 and less than 0.9 extends 
from about 26% to about 79%. Hence, there is a much larger region of the 
curve for which the probability of misclassif ication exceeds 0.10. 
Obviously, if classification accuracy is to be maximized over the entire 
range of possible test scores, then longer tests are required. Ideally, 
a very long test would produce a step function, for which all values of 
possible scores approach either 0.0 or 1,0. 

Figure 9 can be analyzed in a manner similar to that for Figure 8. 
However, Figure 9 has one outstanding characteristic that merits special 
attention. If nonmastery decisions must be made with 90% confidence, 
and a horizontal line at p(M/T) =0.1 is drara, the line does not inter- 
sect the curve for the five item test. This means that it is not possi- 
ble to classify a nonmaster with 90% confidence if a five item test is 
used, given the parameters used in Figure 9. If resource or time con- 
straints are such that no more than five items may be given, and if the 
parameter values used in Figure 9 are realistic, and if 90% confidence 
for mastery decisions are required, then there is^ no reason to test . 
Testing is irrelevant because no matter what score is observed, including 
zero correct, the decision rule compels a mastery decision to be made. 
In fact, for the present values, the probability of mastery given zero 
correct, is equal to 0.21 . This simply means that if persons obtaining 
a score of zero are classified as nonmasters, 21% of them will be mis- 
classified, on the average. 



EKLC 



Conclusions 



The implications of the results from this simulation experiment 
stress the practical importance of test length, criterion scores, and 
accurate estimates of examinee quality in making optimal mastery classi- 
fications. If the assumptions of the model have been met, and if accu- 
rate parameter estimates have been made, then a Bayesian method is optimal 
in the sense of providing more accurate estimates of mastery classification 
with the least- number of test items. 
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In the typical situation for personnel assessment, the examiner will 
have some degree of control over the values of the parameters. His esti- 
mates of the prior probability of mastery will depend upon the goodness 
of the information he can obtain about previous examinee populations' 
scores. His estimates of the conditional probabilities can be made by 
several equally justifiable and logical procedures. In any case, 
informed subjective judgment is absolutely essential. 

The criterion for minimal mastery, expressed as some percent correct 
of the total number of test items, is explicitly under the examiner's 
control. In some testing situations, he may deem 70% correct as a mini- 
mal passing score, whereas when more critcal skills are involved, he may 
want to observe at least 80% correct before calling an examinee a master. 
But as the model demonstrates, test length interacts with per correct 
required for "mastery" decisions. Specifically, as test, length increases, 
classification accuracy increases, even when the same percent correct 
is maintained. In performance-based tests for example, where the cost of 
each item could be very high (such as field artillery or tank gunnery), 
the examiner is obliged to use the minfmum number of trials, and so the 
minimal percent correct mastery criterion should be increased accordingly. 
Finally, the model has demonstrated that testing may be irrelevant in 
making mastery classification decisions if test length does not exceed 
some minimal number of items. 

Appendix ; A Computational Example for Three Mastery States 

The following example illustrates the computations necessary for 
processing data with the Bayesian model. The values chosen for this 
example correspond to Figure 4. Assume that there are three states of 
mastery, and unequal prior probabilities for these three states. The 
educational decision-maker must provide estimates for the prior proba- 
bilities of master, p(Mi) . For this example let us assume the values to 
be: pCMl) = .5; p(M2) = .3; and p(M3) = .2. He must also provide 
estimates for the conditional probability of getting any given test item 
right, given each, mastery state. Tlxe following values will be used as 
the conditional probability of getting an item right given a mastery state: 
pdfMl) = .8; p(1|m2) = .6; p(l/M3) = .5. The conditional probabilities 
of getting an item wrong given a mastery state are: p(o/Ml) = .2; 
p(0/M2) = .4; and p(o/M3) = .5. 

First we need to calculate the probability that an item is answered 
correctly. For the overall population, 

S 

p(tj « correct) =E pOli)p(tj = correct/Mi) = (.5) (.8) + 
i=l 



(.3)(.6) + (.2)(:5) = .68. 
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Likewise, 

p(tj « wrong) « Z p(Mi)p(tj « wronglMi) = (.5) (.2) + 
i-1 * 

(.3),(.4) + (v2)C.5) = .32. 

We also need tc obtain the set of conditional probabilities for the 
different mastery states given that an individual item was responded to 
either correctly cr wii':>r>3;ly . The general equation is: 

paiiltj) = p(Mi)p(ti/Mi) 
P(tj) 

Substituting the above values yields: p(Ml/tj = correct) = (.5) (.8) 
V .68 = .588; p(M2/tj = correct) = (.3) (.6) v .68 = .265; and p(M3/tj = 
correct) = (.2) (.5) ^ .68 = .147. (Note that the sum equals 1.0.) Finally, 
pC-ll/tj = wrong) = (.5) (.2) ^ .32 = .3125; p(M2/tj = wrong) = (.3) (.4) r 
.32 = .375 and p(M3/tj = wrong) = (.2) (.5) r .32 = .3125. If 6 items were 
answered correctly on a 10 item criteri-on-ref erenced test, the following 

IT p(Mijtj) values result: Ml = 3.9 x 10""^; 
j=l M2 = 6.8 X 10-6; 

M3 = 9.6 X 10-8 . 

Finally, the general Bayesian formula yields the conditional proba-- 
bility for each mastery state given the total test score. For example, 

^ (3.9 X 10-4) ^ 

V0i^\^) - (,5)9 r( 3.9 X 10-4 )+( 6.8 x 10-b) +( 9.6 x lO-S^ 

^ t.5)y (.3)^J (.2)9 

Similar calculations yield pC12/t) « .473 and p(M3jT) = .254. 
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